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Web templates are one of the main development resources for website engineers. Templates allow 
them to increase productivity by plugin content into already formatted and prepared page lets. For 
the final user templates are also useful, because they provide uniformity and a common look and 
feel for all webpages. However, from the point of view of crawlers and indexers, templates are an 
important problem, because templates usually contain irrelevant information such as advertisements, 
menus, and banners. Processing and storing this information is likely to lead to a waste of resources 
(storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 
50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work 
we propose a novel method for automatic template extraction that is based on similarity analysis 
between the DOM trees of a collection of webpages that are detected using menus information. Our 
implementation and experiments demonstrate the usefulness of the technique. 


1 Introduction 

A web template (in the following just template) is a prepared HTML page where formatting is already 
implemented and visual components are ready so that we can insert content into them. Templates are 
used as a basis for composing new webpages that share a common look and feel. This is good for 
web development because many tasks can be automated thanks to the reuse of components. In fact, 
many websites are maintained automatically by code generators that generate webpages using templates. 
Templates are also good for users, which can benefit from intuitive and uniform designs with a common 
vocabulary of colored and formatted visual elements. 

Templates are also important for crawlers and indexers, because they usually judge the relevance of a 
webpage according to the frequency and distribution of terms and hyperlinks. Since templates contain a 
considerable number of common terms and hyperlinks that are replicated in a large number of webpages, 
relevance may turn out to be inaccurate, leading to incorrect results (see, e.g., Ill [m HI). Moreover, in 
general, templates do not contain relevant content, they usually contain one or more pagelets ||5l[Il (i.e., 
self-contained logical regions with a well defined fopic or funchonalify) where fhe main confenf musf be 
inserfed. Therefore, defecfing femplafes can allow indexers fo idenfify fhe main confenf of fhe webpage. 

Modern crawlers and indexers do nol freaf all terms in a webpage in fhe same way. Webpages are 
preprocessed fo idenfify fhe femplafe because femplafe exfracfion allows fhem fo idenfify fhose pagelefs 
fhaf only confain noisy informafion such as adverfisemenfs and banners. This confenf should nof be 
indexed in fhe same way as fhe relevanf confenf. Indexing fhe non-confenf parf of femplafes nof only 
affecfs accuracy, if also affecfs performance and can lead fo a waste of storage space, bandwidfh, and 
fime. 
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Template extraction helps indexers to isolate the main content. This allows us to enhance indexers 
by assigning higher weights to the really relevant terms. Once templates have been extracted, they are 
processed for indexing—they can be analyzed only once for all webpages using the same template—. 
Moreover, links in templates allow indexers to discover the topology of a website (e.g., through naviga¬ 
tional content such as menus), thus identifying the main webpages. They are also essential to compute 
pager anks. 

Gibson et al. |181 determined that templates represent between 40% and 50% of data on the Web and 
that around 30% of the visible terms and hyperlinks appear in templates. This justifies the importance of 
template removal ifTSl [T6]l for web mining and search. 

Our approach to template extraction is based on the DOM 0 structures that represent webpages. 
Roughly, given a webpage in a website, we first identify a set of webpages that are likely to share a 
template with it, and then, we analyze these webpages to identify the part of their DOM trees that is 
common with the original webpage. This slice of the DOM tree is returned as the template. 

Our technique introduces a new idea to automatically find a sef of webpages fhaf pofenfially share a 
femplafe. Roughly, we deled Ihe lemplale’s menu and analyze ils links lo idenlify a sef of mulually linked 
webpages. One of fhe main funclions of a femplafe is in aiding navigalion, fhus almosl all femplafes 
provide a large number of links, shared by all webpages implemenling fhe femplafe. Localing fhe menu 
allows us fo idenlify in fhe lopology of fhe websife fhe main webpages of each calegory or secfion. These 
webpages very likely share fhe same femplafe. This idea is simple bul powerful and, conlrarily lo olher 
approaches, if allows fhe lechnique fo only analyze a reduced sef of webpages lo idenlify fhe femplafe. 

The resl of fhe paper has been slruclured as follows: In Seclionj^we discuss fhe slafe of fhe arf and 
show some problems of currenl techniques lhal can be solved wilh our approach. In Seclionj^we provide 
some preliminary definifions and useful nolafion. Then, in Secfion we presenl our technique wilh 
examples and explain fhe algorifhms used. In Seclionj^we give some delails abouf fhe implemenlafion 
and show fhe resulls obfained from a collecfion of benchmarks. Finally, Secfion [^concludes. 

2 Related Work 

Templale defection and exlraclion are hoi topics due lo Iheir direcl applicafion to web mining, searching, 
indexing, and web developmenl. For Ihis reason, fhere are many approaches fhaf fry to face fhis problem. 
Some of fhem are especially fhoughf for boilerplate removal and conlenf exlraclion; and Ihey have been 
presenled in fhe CleanEval compefilion ||2l, which proposes a collecfion of examples lo be analyzed wilh 
a gold slandard. 

Content Extraction is a discipline very close lo femplafe exlraclion. Conlenf exlraclion fries to isolate 
the pagelet with the main content of the webpage. It is an instance of a more general discipline called 
Block Detection that tries to isolate every pagelet in a webpage. There are many works in these fields 
(see, e.g., mEisiiia), and all of fhem are direclly relaled lo templale exlraclion. 

In fhe area of femplafe exlraclion, fhere are Ihree differenl ways lo solve fhe problem, namely, (i) 
using fhe lexlual informalion of fhe webpage (i.e., fhe HTML code), (ii) using fhe rendered image of fhe 
webpage in fhe browser, and (iii) using fhe DOM free of fhe webpage. 

The firsl approach analyzes fhe plain HTML code, and if is based on fhe idea lhal fhe main conlenf 
of fhe webpage has more densify of fexl, wilh less labels. Lor insfance, fhe main conlenf can be iden¬ 
tified selecfing fhe largesl contiguous fexl area wilh fhe leasl amounl of HTML lags Q. This has been 
measured direclly on fhe HTML code by counling fhe number of characlers inside fexl, and characters 
inside labels. This measure produces a ralio called CLTR ifTTIl used lo discriminale fhe main conlenf. 
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Other approaches exploit densitometric features based on the observation that some specific terms are 
more common in templates ifT^fTTTl . The distribution of the code between the lines of a webpage is not 
necessarily the one expected by the user. The format of the HTML code can be completely unbalanced 
(i.e., without tabulations, spaces or even carriage returns), specially when it is generated by a non-human 
directed system. As a common example, the reader can see the source code of the main Google’s web¬ 
page. At the time of writing these lines, all the code of the webpage is distributed in only a few lines 
without any legible structure. In this kind of webpages CETR is useless. 

The second approach assumes that the main content of a webpage is often located in the central 
part and (at least partially) visible without scrolling This approach has been less studied because 
rendering webpages for classification is a computational expensive operation lIT^ . 

The third approach is where our technique falls. While some works try to identify pagelets analyzing 
the DOM tree with heuristics fT], others try to find common subtrees in the DOM trees of a collection of 
webpages in the website ifTSlfT^ . Our technique is similar to these last two works. 

Even though ifTSl uses a method for template extraction, its main goal is to remove redundant parts of 
a website. Eor this, they use the Site Style Tree (SST), a data structure that is constructed by analyzing a 
set of DOM trees and recording every node found, so that repeated nodes are identified by using counters 
in the SST nodes. Hence, an SST summarizes a set of DOM trees. After the SST is built, they have 
information about the repetition of nodes. The most repeated nodes are more likely to belong to a noisy 
part that is removed from the webpages. Unfortunately, this approach does not use any method to detect 
the webpages that share the same template. They are randomly selected, and this can negatively influence 
the performance and the precision of the technique. 

In |[T6ll . the approach is based on discovering optimal mappings between DOM trees. This mapping 
relates nodes that are considered redundant. Their technique uses the RTDM-TD algorithm to compute 
a special kind of mapping called restricted top-down mapping Ifldll . Their objective, as ours, is template 
extraction, but there are two important differences. Eirst, we compute another kind of mapping to identify 
redundant nodes. Our mapping is more restrictive because it forces all nodes that form pairs in the 
mapping to be equal. Second, in order to select the webpages of the website that should be mapped 
to identify the template, they pick random webpages until a threshold is reached. In their experiments, 
they approximated this threshold as a few dozens of webpages. In our technique, we do not select the 
webpages randomly, we use a method to identify the webpages linked by the main menu of the website 
because they very likely contain the template. We only need to explore a few webpages to identify the 
webpages that implement the template. Moreover, contrarily to us, they assume that all webpages in the 
website share the same template, and this is a strong limitation for many websites. 


3 Preliminaries 

The Document Object Model (DOM) @ is an API that provides programmers with a standard set of 
objects for the representation of HTME and XME documents. Our technique is based on the use of 
DOM as the model for representing webpages. Given a webpage, it is completely automatic to produce 
its associated DOM structure and vice-versa. In fact, current browsers automatically produce the DOM 
structure of all loaded webpages before they are processed. 

The DOM structure of a given webpage is a tree where all the elements of the webpage are repre¬ 
sented (included scripts and CSS styles) hierarchically. This means that a table that contains another 
table is represented with a node with a child that represents the internal table. 

In the following, webpages are represented with a DOM tree T = {N,E) where A is a finite set of 
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nodes and is a set of edges between nodes in N (see Figure [^. root{T) denotes the root node of T. 
Given a node n £ N, link{n) denotes the hyperlink of n when n is a node that represents a hyperlink 
(HTML label <a>). parent{n) represents node n' £ N such that {n',n) £ E. Similarly, children{n) repre¬ 
sents the set {n! £N \ {n,n') £ E}. subtree{n) denotes the subtree of T whose root hn£N. path{n) is 
a non-empty sequence of nodes that represents a DOM path', it can be defined as path{n) = iioiiy . ..rim 
such that V/,0 <i<m. ni = parent {nij^\). 



Figure 1: Equal top-down mapping between DOM trees 

In order to identify the part of the DOM tree that is common in a set of webpages, our technique uses 
an algorithm that is based on the notion of mapping. A mapping establishes a correspondence between 
the nodes of two trees. 

Definition 3.1 (based on Kuo’s definition of mapping i liJI/ ) A mapping/rom a tree T = {N,E) to a tree 
T' = [N' ,E') is any set M of pairs of nodes {n,n') £M, n £N ,n' £N' such that, for any two pairs (ni,n[) 
and {n 2 ,n 2 ) £M, n\= n 2 iffn'.^ = nf 

In order to identify templates, we are interested in a very specific kind of mapping thaf we call equal 
top-down mapping (ETDM). 

Definition 3.2 Given an equality relation = between tree nodes, a mapping M between two trees T and 
T' is said to be equal top-down if and only if 

• equal: for every pair {n,n') £M, n = n'. 

• top-down: for every pair {n,n') £ M, with n root{T) and n' root{T'), there is also a pair 
{parent{n),parent{n')) £M. 

Note that this definition is parametric with respect to the equality function =: A x A' —)• [0..1] where 
A and A' are sets of nodes of two (often different) DOM trees. We could simply use the standard 
equality (=), but we left this relation open, to be general enough as to cover any possible implementation. 
In particular, other techniques consider that two nodes n\ and n 2 are equal if they have the same label. 
However, in our implementation we use a notion of node equality much more complex that uses the label 
of the node, its classname, its HTME attributes, its children, its position in the DOM tree, etc. 

This definition of mapping allows us to be more restrictive than other mappings such as, e.g., the 
restricted top-down mapping (RTDM) introduced in llTdll . While RTDM permits the mapping of different 
nodes (e.g., a node labelled with table with a node labelled with div), ETDM can force all pairwise 
mapped nodes to have the same label. Eigure[^ shows an example of an ETDM using: n = n' if and only 
if n and n' have the same label. We can now give a definition of template using ETDM. 
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Figure 2: Webpages of BBC sharing a template 


Definition 3.3 Let po be a webpage whose associated DOM tree is Tq = {Nq,Eq), and let P = {p\.. .pn} 
be a collection of webpages with associated DOM trees {T\... r„}. A template of po with respect to P is 
a tree {N,E) where 

• nodes: N = {n £ Nq \ yi, \ < i <n . {n,_) £ Mt^j^} where is an equal top-down mapping 

between trees Tq and f. 

• edges: E = {{m,m') £ Eq \ m,m' £ N}. 

Hence, the template of a webpage is computed with respect to a set of webpages (usually webpages 
in the same website). We formalize the template as a new webpage computed with an ETDM between 
the initial webpage and all the other webpages. 

4 Template extraction 

Templates are often composed of a set of pagelets. Two of the most important pagelets in a webpage are 
the menu and the main content. For instance, in Figurej^we see two webpages that belong to the “News” 
portal of BBC. At the top of the webpages we find the main menu containing links to all BBC portals. 
We can also see a submenu under the big word “News”. The left webpage belongs to the “Technology” 
section, while the right webpage belongs to the “Science & Environment” section. Both share the same 
menu, submenu, and general structure. In both pages the news are inside the pagelet in the dashed square. 
Note that this pagelet contains the main content and, thus, it should be indexed with a special treatment. 
In addition to the main content, there is a common pagelet called “Top Stories” with the most relevant 
news, and another one called “Eeatures and Analysis”. 

Our technique inputs a webpage (called key page) and it outputs its template. To infer the template, 
it analyzes some webpages from the (usually huge) universe of directly or indirectly linked webpages. 
Therefore, we need to decide what concrete webpages should be analyzed. Our approach is very simple 
yet powerful: 

1. Starting from the key page, it identifies a complefe subdigraph in fhe websife fopology, and fhen 

2. if exfraefs fhe femplafe by calculating an ETDM befween fhe DOM free of fhe key page and some 
of fhe DOM trees of the webpages in the complete subdigraph. 

Both processes are explained in the following sections. 
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4.1 Finding a complete subdigraph in a website topology 

Given a website topology, a complete subdigraph (CS) represents a collection of webpages that are pair¬ 
wise mutually linked. A n-complete subdigraph (n-CS) is formed by n nodes. Our interest in complete 
subdigraphs comes from the observation that the webpages linked by the items in a menu usually form 
a CS. This is a new way of identifying the webpages that contain the menu. At the same time, these 
webpages are the roots of the sections linked by the menu. The following example illustrates why menus 
provide very useful information about the interconnection of webpages in a given website. 

Example 4.1 Consider the BBC website. Two of its webpages are shown in Figure^ In this website all 
webpages share the same template, and this template has a main menu that is present in all webpages, 
and a submenu for each item in the main menu. The site map of the BBC website may be represented 
with the topology shown in Figure^ 


Domain A 



Figure 3: BBC Website topology 

In this figure, each node represents a webpage and each edge represents a link between two web¬ 
pages. Solid edges are bidirectional, and dashed and dotted edges are directed. Black nodes are the 
webpages pointed by the main menu. Because the main menu is present in all webpages, then all nodes 
are connected to all black nodes (we only draw some of the edges for clarity). Therefore all black nodes 
together form a complete graph (i.e., there is an edge between each pair of nodes). Grey nodes are the 
webpages pointed by a submenu, thus, all grey nodes together also form a complete graph. White nodes 
are webpages inside one of the categories of the submenu, therefore, all of them have a link to all black 
and all grey nodes. 

Of course, not all webpages in a website implement the same template, and some of them only 
implement a subset of a template. For this reason, one of the main problems of template extraction is 
deciding what webpages should be analyzed. Minimizing the number of webpages analyzed is essential 
to reduce the web crawlers work. In our technique we introduce a new idea to select the webpages that 
must be analyzed: we identify a menu in the key page and we analyze the webpages pointed out by this 
menu. Observe that we only need to investigate the webpages linked by the key page, because they will 
for sure contain a CS that represents the menu. 

In order to increase precision, we search for a CS that contains enough webpages that implement the 
template. This CS can be identified with Algorithm [T] 

This algorithm inputs a webpage and the size n of the CS to be computed. We have empiri¬ 
cally approximated the optimal value for n, which is 4. The algorithm uses two trivial functions: 
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Algorithm 1 Extract a n-CS from a website 

Input: An initialLink that points to a webpage and the expected size n of the CS. 
Output: A set of links to webpages that together form a n-CS. 

If a n-CS cannot be formed, then they form the biggest m-CS with m < n. 


begin 

keyPage — loadWebPage(initialLink)\ 
reachableLinks — getLinks(keyPage); 
processedLinks — 0; 
connections — 0; 
bestCS — 0; 

foreach link in reachableLinks 
webPage — loadWebPage(link)', 
existingLinks — getLinks{webPage) H reachableLinks', 
processedLinks — processedLinks U {link}', 

connections — connectionsU {(link —> existingLink) \ existingLink G existingLinks}; 

CS — {/.y G (processedLinks) \ link E Is A^IJ' E Is . (I ^ l'),(l' ^ 1) E connections}; 
maximalCS — csC. CS such that Vc/ G C5 . |c5| > lei'll; 
if \maximalCS\ — n then return maximalCS; 
if \maximalCS\ > \bestCS\ then bestCS — maximalCS; 
return bestCS; 
end 


loadWebPage{link), which loads and returns the webpage pointed by the input link, and getLinks{webpage), 
which returns the collection of (non-repeated) link^in the input webpage (ignoring self-links). Observe 
that the main loop iteratively explores the links of the webpage pointed by the initialLink (i.e., the key 
page) until it founds a n-CS. Note also that it only loads those webpages needed to find the n-CS, and it 
stops when the n-CS has been found. We want to highlight the mathematical expression 

CS — {Is ^ ^(processedLinks) \ link G /i'A V/, L E Is . (I ^ —> /) G connections}, 

where Ji^{X) returns all possible partitions of set X. 

It is used to hnd the set of all CS that can be constructed with the current link. Here, processedLinks 
contains the set of links that have been already explored by the algorithm. And connections is the set 
of all links between the webpages pointed by processedLinks. connections is used to identify the CS. 
Therefore, the set CS is composed of the subset of processedLinks that form a CS using connections. 

Observe that the current link must be part of the CS (link G Is) to ensure that we make progress (not 
repeating the same search of the previous iteration). Moreover, because the CS is constructed incremen¬ 
tally, the statement 

if \maximalCS\ = n then return maximalCS 
ensures that whenever a n-CS can be formed, it is returned. 

4.2 Template extraction from a complete subdigraph 

After we have found a set of webpages mutually linked and linked by the menu of the site (the complete 
subdigraph), we identify an ETDM between the key page and all webpages in the set. Eor this, initially, 
the template is considered to be the key page. Then, we compute an ETDM between the template and 
one webpage in the set. The result is the new rehned template, that is further rehned with another ETDM 
with another webpage, and so on until all webpages have been processed. This process is formalized in 
Algorithm]^ that uses function ETDM to compute the biggest ETDM between two trees. 

*In our implementation, this function removes those links that point to other domains because they are very unlikely to 
contain the same template. 
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Algorithm 2 Extract a template from a set of webpages 

Input: A key page pj^ — (N\,E \) and a set of n webpages P. 

Output: A template for pk with respect to P. 

begin 

template — pi(\ 
foreach {p in P) 

if root{pii) = root(p) 

template — ETDM (template, p)\ 
return template', 
end 

function £'rZ)M(tree T\ — (N\,E\), tree T2 — (A^2j^ 2)) 
ri = root(T\)\ 
r 2 — root(T 2 )', 
nodes = {ri}; 
edges = 0 ; 

foreach ni G N\\nodes, n2 G N2\nodes . «i = n2, (ri,ni) G £'1 and (^2,^2) € £2 

max 

(nodes _st, ed ges_st) — ETDM (subtree(n 1), subtree(n2 )); 
nodes — nodesUnodes^f, 
edges — edgesiJ edges _st\j{(r\,n\)]\ 
return (nodes,edges)'. 


As in Definition 


3.2 


we left the algorithm parametric with respect to the equality function =. This 


is done on purpose, because this relation is the only parameter that is subjective and thus, it is a good 
design decision to leave it open. For instance, a researcher can decide that two DOM nodes are equal 
if they have the same label and attributes. Another researcher can relax this restriction ignoring some 
attributes (i.e, the template can be the same, even if there are differences in colors, sizes, or even positions 
of elements. It usually depends on the particular use of the extracted template). Clearly, = has a direct 
influence on the precision and recall of the technique. The more restrictive, the more precision (and less 
recall). Note also that the algorithm uses n\ = n 2 to indicate that n\ and n 2 maximize function =. 


In our implementation, function = is defined with a ponderation that compares two nodes considering 
their classname, their number of children, their relative position in the DOM tree, and their HTML 
attributes. We refer the interested reader to our open and free implementation (http://www.dsic. 
upv.es/~jsilva/retrieval/teniplates/) where function = is specified. 


5 Implementation 

The technique presented in this paper, including all the algorithms, has been implemented as a Firefox’s 
plugin. In this tool, the user can browse on the Internet as usual. Then, when he/she wants to extract the 
template of a webpage, he/she only needs to press the “Extract Template” button and the tool automati¬ 
cally loads the appropriate linked webpages to form a CS, analyzes them, and extracts the template. The 
template is then displayed in the browser as any other webpage. For instance, the template extracted for 
the webpages in Figure [^contains the whole webpage except for the part inside the dashed box. 

Our implementation and all the experimentation is public. All the information of the experiments, 
the source code of the benchmarks, the source code of the tool, and other material can be found at: 


http://www.dsic.upv.es/~jsilva/retrieval/templates/ 
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6 Conclusions 

Web templates are an important tool for website developers. By automatically inserting content into 
web templates, website developers, and content providers of large web portals achieve high levels of 
productivity, and they produce webpages that are more usable thanks to their uniformity. 

This work presents a new technique for template extraction. The technique is useful for website 
developers because they can automatically extract a clean HTML template of any webpage. This is 
particularly interesting to reuse components of other webpages. Moreover, the technique can be used 
by other systems and tools such as indexers or wrappers as a preliminary stage. Extracting the template 
allows them to identify the structure of the webpage and the topology of the website by analyzing the 
navigational information of the template. In addition, the template is useful to identify pagelets, repeated 
advertisement panels, and what is particularly important, the main content. 

Our technique uses the menus of a website to identify a set of webpages that share the same template 
with a high probability. Then, it uses the DOM structure of the webpages to identify the blocks that are 
common to all of them. These blocks together form the template. To the best of our knowledge, the 
idea of using the menus to locate the template is new, and it allows us to quickly find a set of webpages 
from which we can extract the template. This is especially interesting for performance, because loading 
webpages to be analyzed is expensive, and this part of the process is minimized in our technique. As an 
average, our technique only loads 7 pages to extract the template. 

This technique could be also used for content extraction. Detecting the template of a webpage is very 
helpful to detect the main content. Firstly, the main content must be formed by DOM nodes that do not 
belong to the template. Secondly, the main content is usually inside one of the pagelets that are more 
centered and visible, and with a higher concentration of text. 

For future work, we plan to investigate a strategy to further reduce the amount of webpages loaded 
with our technique. The idea is to directly identify the menu in the key page by measuring the density 
of links in its DOM tree. The menu has probably one of the higher densities of links in a webpage. 
Therefore, our technique could benefit from measuring the links-DOM nodes ratio to directly find fhe 
menu in fhe key page, and fhus, a complefe subdigraph in fhe websife fopology. 
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